Error

LW server reports: not allowed.

This probably means the post has been deleted or moved back to the author's drafts.

Roy Rinberg 23 Feb 2026 19:35 UTC
1 point
0
We
I will say, now that I’ve fully read this, I don’t fully get why this is the title that you prefer.
It seems like
1. we should say somethign aboutwhat are we measuring or 2 our conclusion
2. this title leaves me unsure what this post will be about, except for the pace of benchmark saturation, whcih isn’t even the point of the blog
- bira 24 Feb 2026 5:56 UTC
  1 point
  0
  Parent
  I just think it’s a good hook, taking a leaf from how news pieces are written
  - Roy Rinberg 24 Feb 2026 6:35 UTC
    1 point
    0
    Parent
    I mean—I hear what you’re saying, and I don’t agree. It’s not that important of a post, so it’s fine, but I think it doesn’t really hook people.
    I don’t think it communicates what the post is about “We Added Typos to a Benchmark. Then Haiku Saturated It.”
    I don’t think it’s about saturation at all—as evidence of this, I ran “ctrl-f” for satur and there are 0 mentions to it in the body.
    less critically—“we added typos to a benchmark”—this is more of the action than the reason, which kind of just leaves someone asking “why is this relevant”
    - bira 24 Feb 2026 7:02 UTC
      1 point
      0
      Parent
      Got it. I was indeed thinking to frame this blogpost as more lightheartedly. Would this be better: “We added typos to a benchmark, then Haiku’s scores jumped” so there is no mention of “saturation”. I’m thinking the blogpost is more of “why did Haiku’s score jump” instead of “LLMs are robust to typos”.
Roy Rinberg 23 Feb 2026 19:34 UTC
1 point
1
Scores are lower bounds
this is true in this context, but it’s worth saying that these models seem to be trained for these benchmakrs, so it’s not entirely true that they are lower-bounds, since it seems that models are trained to do well on benchmarks.
The size of the “lower bound”ness is hard to comment on, but if you can provide some input on the amount of improvmenet you think there is from tuning your harness, that is a meaningful conclusion. That could be what we build the blog around,
Roy Rinberg 23 Feb 2026 19:32 UTC
1 point
1
When Anthropic dropped Opus 4.6, we asked it to figure it out. From our eval logs, Opus 4.6 observed that for a single BigCodeBench prompt, Haiku and Opus often generated multiple code blocks. Furthermore, as typo rate increases, Haiku shifts its behavior for ~20% of its responses from generating multiple code blocks to generating just a single code block.
I also have to say, i don’t know what you mean by “code block”—does this mean response?
Roy Rinberg 23 Feb 2026 19:30 UTC
1 point
1
When Anthropic dropped Opus 4.6, we asked it to figure it out.
I think this is a bit informal.
Also it’s odd to say this because it sounds like you’re saying “this isn’t what we think” or “if it’s wrong, judge it, not us” and you have to claim responsbility for any findings that AI generates.
I would not say this
Roy Rinberg 23 Feb 2026 19:27 UTC
1 point
1
The Anomaly is Benchmark-Specific
We then tested whether this “impossible typo effect” holds for Haiku and Opus on other benchmarks. We chose BBH and GPQA since Haiku struggles reasonably without introducing typos. Here, we no longer observed the impossible typo effect, and Haiku’s capabilities decreased with typo rates.
this shouild be 1 figure, not 2
and i think this couild be combined with the previous section as well
Roy Rinberg 23 Feb 2026 19:27 UTC
1 point
1
We chose BBH and GPQA
should link to the benchmarks, im actually not immediatley clear what BBH is (but i get it upon further inspection)
Roy Rinberg 23 Feb 2026 19:26 UTC
1 point
1
“impossible typo effect”
I don’t think this is a term you’ve introduced before
Roy Rinberg 23 Feb 2026 19:26 UTC
1 point
0
The Anomaly is Model-Specific
We then tested if other small models have this “impossible typo effect”. We found that, unlike Haiku, the capabilities of GPT-4.1-mini slightly decreased as typos increased.
this plot should have multiple models on the same plot, otherwise it’s somewhat confusing section header.
Roy Rinberg 23 Feb 2026 19:25 UTC
1 point
1
the hypothesis is refuted.
I really am fairly opposed to sentences like this
Roy Rinberg 23 Feb 2026 19:24 UTC
1 point
1
research
observed result in humans
(or cite something that isn’t a bbc article)
Roy Rinberg 23 Feb 2026 19:24 UTC
1 point
1
thinks
think
Roy Rinberg 23 Feb 2026 19:24 UTC
1 point
1
Squint
it isn’t clear what “squint” means, and I don’t think maybe “think” is better
Roy Rinberg 23 Feb 2026 19:23 UTC
1 point
1
We were curious if LLMs are robust under various typo rates.
Hmm, couple thoughts, i don’t think my write up is perfect, but how about something more like this:

> We were curious if LLMs produce the same response when there are typos in the prompt. To test this, we injected typos into the prompts from BigCodeBench and ran different Claude models. We found that while the accuracy for Opus gradually declined with typo rate, for Haiku, its accuracy actually increased as typo rate increased. This blog investigates this counter-intuitve phenomena.
(points I care about: 1. “robust under various typo rates” doesn’t sound like a linear increase in typos, which is what we actually do.
2. “We double-checked our code and asked Claude to do so too, but we couldn’t find any bugs”—this should be assumed for all your work.
3. “The mystery begins.”—this format does sound quite AI generated
)
Roy Rinberg 23 Feb 2026 19:16 UTC
1 point
1
I think move the image to be right after this intro paragraph
[ ]
[deleted]

Error

Scores are lower bounds

The Anomaly is Benchmark-Specific

The Anomaly is Model-Specific